miss rate
X-SYCON: Xylem-Inspired Passive Gradient Control for Communication-Free Swarm Response in Dynamic Disaster Environments
Baek, Arthur Ji Sung, Martin, Geoffrey
We present X-SYCON, a xylem-inspired multi-agent architecture in which coordination emerges from passive field dynamics rather than explicit planning or communication. Incidents (demands) and obstructions (hazards) continually write diffusing and decaying scalar fields, and agents greedily ascend a local utility $U=ฯ_{\mathrm{DE}}-ฮบ\,ฯ_{\mathrm{HZ}}$ with light anti-congestion and separation. A beaconing rule triggered on first contact temporarily deepens the local demand sink, accelerating completion without reducing time-to-first-response. Across dynamic, partially blocked simulated environments, we observe low miss rates and stable throughput with interpretable, tunable trade-offs over carrier count, arrival rate, hazard density, and hazard sensitivity $ฮบ$. We derive that a characteristic hydraulic length scale $\ell\approx\sqrt{D/ฮป}$ predicts recruitment range in a continuum approximation, and we provide a work-conservation (Ohm-law) bound consistent with sublinear capacity scaling with team size. Empirically: (i) soft hazard penalties yield fewer misses when obstacles already block motion; (ii) throughput saturates sublinearly with carriers while reliability improves sharply; (iii) stronger arrivals can reduce misses by sustaining sinks that recruit help; and (iv) phase-stability regions shrink with hazard density but are recovered by more carriers or higher arrivals. We refer to X-SYCON as an instance of Distributed Passive Computation and Control, and we evaluate it in simulations modeling communication-denied disaster response and other constrained sensing-action regimes.
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Bayat, Farima Fatahi, Pezeshkpour, Pouya, Hruschka, Estevam
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
Are Large Language Models a Good Replacement of Taxonomies?
Sun, Yushi, Xin, Hao, Sun, Kai, Xu, Yifan Ethan, Yang, Xiao, Dong, Xin Luna, Tang, Nan, Chen, Lei
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
Evaluating LLMs for Gender Disparities in Notable Persons
Rhue, Lauren, Goethals, Sofie, Sundararajan, Arun
This study examines the use of Large Language Models (LLMs) for retrieving factual information, addressing concerns over their propensity to produce factually incorrect "hallucinated" responses or to altogether decline to even answer prompt at all. Specifically, it investigates the presence of gender-based biases in LLMs' responses to factual inquiries. This paper takes a multi-pronged approach to evaluating GPT models by evaluating fairness across multiple dimensions of recall, hallucinations and declinations. Our findings reveal discernible gender disparities in the responses generated by GPT-3.5. While advancements in GPT-4 have led to improvements in performance, they have not fully eradicated these gender disparities, notably in instances where responses are declined. The study further explores the origins of these disparities by examining the influence of gender associations in prompts and the homogeneity in the responses.
t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators
Kinnunen, Tomi, Lee, Kong Aik, Tak, Hemlata, Evans, Nicholas, Nautsch, Andreas
Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.
LMR: Lane Distance-Based Metric for Trajectory Prediction
Schmidt, Julian, Monninger, Thomas, Jordan, Julian, Dietmayer, Klaus
The development of approaches for trajectory prediction requires metrics to validate and compare their performance. Currently established metrics are based on Euclidean distance, which means that errors are weighted equally in all directions. Euclidean metrics are insufficient for structured environments like roads, since they do not properly capture the agent's intent relative to the underlying lane. In order to provide a reasonable assessment of trajectory prediction approaches with regard to the downstream planning task, we propose a new metric that is lane distance-based: Lane Miss Rate (LMR). For the calculation of LMR, the ground-truth and predicted endpoints are assigned to lane segments, more precisely their centerlines. Measured by the distance along the lane segments, predictions that are within a certain threshold distance to the ground-truth count as hits, otherwise they count as misses. LMR is then defined as the ratio of sequences that yield a miss. Our results on three state-of-the-art trajectory prediction models show that LMR preserves the order of Euclidean distance-based metrics. In contrast to the Euclidean Miss Rate, qualitative results show that LMR yields misses for sequences where predictions are located on wrong lanes. Hits on the other hand result for sequences where predictions are located on the correct lane. This means that LMR implicitly weights Euclidean error relative to the lane and goes into the direction of capturing intents of traffic agents. The source code of LMR for Argoverse 2 is publicly available.
Adaptive Caching by Refetching
We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49โ63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a specific new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.
Artificial intelligence excels at catching pre-cancerous cells
Authors of a new study on detecting precancerous polyps in colorectal cancer screening came to the conclusion that "Artificial Intelligence (AI) may detect colorectal polyps that have been missed due to perceptual pitfalls." They go on to say "By reducing such miss rate, Artificial Intelligence may increase the detection of colorectal neoplasia leading to a higher degree of Colorectal Cancer (CRC) prevention." According to a news release, a team of international researchers led by Mayo Clinic reported that AI "reduced by twofold the rate at which pre-cancerous polyps were missed in colorectal cancer screening." The Mayo Clinic defines a colon polyp as "a small clump of cells that forms on the lining of the colon" and says most are harmless. Yet, it cautions, "over time, some colon polyps can develop into colon cancer, which may be fatal when found in its later stages."
AI reduces miss rate of precancerous polyps in colorectal cancer screening
Most colon polyps are harmless, but some over time develop into colon or rectal cancer, which can be fatal if found in its later stages. Colorectal cancer is the second most deadly cancer in the world, with an estimated 1.9 million cases and 916,000 deaths worldwide in 2020, according to the World Health Organization. A colonoscopy is an exam used to detect changes or abnormalities in the large intestine (colon) and rectum. Between February 2020 and May 2021, 230 study participants each underwent two back-to-back colonoscopies on the same day at eight hospitals and community clinics in the U.S., U.K. and Italy. One colonoscopy used AI; the other, a standard colonoscopy, did not. The rate at which precancerous colorectal polyps is missed has been estimated to be 25%.